library(tidyverse)
library(broom)
library(mice)
<- 1000
n
set.seed(928)
<- tibble(
data c_x = rnorm(n, sd = 0.71),
x = c_x + rnorm(n, sd = 0.71),
c_y = rnorm(n),
y = x + c_y + rnorm(n),
noise = rnorm(n),
x_miss = rbinom(n, 1, 1 / (1 + exp(-(c_x + c_y)))),
x_obs = ifelse(
x_miss,NA,
x
) )
Are two wrong models better than one? Missing data style
After my previous post about missing data, Kathy asked on Twitter whether two wrong models (the imputation model + the outcome model) would be better than one (the outcome model alone).
Without doing any of the math, I’d guess the assumption of correctly spec the model also has a bigger impact in the CC analysis.
You need correct spec in MI, twice, but trade off that potential bias for higher prec.
This is a great question! I am going to investigate via a small simulation (so the answer could be “it depends”, but at least we will know how it seems to work in this very simple case) 😆.
Ok so here I have some predictor, x
that is missing 50% of the time, dependent on c_x
and c_y
. The right imputation model would have c_x
, the right outcome model needs c_y
. Unfortunately, we only have access to one, which we will try to use in our imputation model (and outcome model). Let’s see whether two (wrong) models are better than one!
A “correct” model will be one that estimates that the coefficient for x
is 1.
We only have c_x
Ok first let’s look at the whole dataset.
<- lm(y ~ x + c_x, data = data) |>
mod_full_c_x tidy(conf.int = TRUE) |>
filter(term == "x") |>
select(estimate, conf.low, conf.high)
mod_full_c_x
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 1.01 0.877 1.14
This checks out! c_x
basically does nothing for us here, but because c_y
is not actually a confounder (it just informs the missingness & y
, which we aren’t observing here), we are just fine estimating our “wrong” model in the fully observed data. Now let’s do the “complete cases” analysis.
<- na.omit(data)
data_cc <- lm(y ~ x + c_x, data = data_cc) |>
mod_cc_c_x tidy(conf.int = TRUE) |>
filter(term == "x") |>
select(estimate, conf.low, conf.high)
mod_cc_c_x
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.999 0.812 1.19
This does fine! Now let’s do some imputation. I am going to use the mice
package.
<- mice(
imp_data_c_x
data, m = 5,
method = "norm.predict",
formulas = list(x_obs ~ c_x),
print = FALSE)
Ok let’s compare how this model does “alone”.
<- with(imp_data_c_x, lm(y ~ x_obs)) |>
mod_imp_c_x pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_c_x
estimate conf.low conf.high
1 1.026666 0.9042858 1.149046
Great! This was the right model, so we would expect this to perform well.
Now what happens if we adjust for c_x
in addition in the outcome model:
<- with(imp_data_c_x, lm(y ~ x_obs + c_x)) |>
mod_double_c_x pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_c_x
estimate conf.low conf.high
1 0.9991868 0.7968509 1.201523
The right imputation model with the wrong outcome model is fine!
We only have c_y
Ok first let’s look at the whole dataset.
<- lm(y ~ x + c_y, data = data) |>
mod_full_c_y tidy(conf.int = TRUE) |>
filter(term == "x") |>
select(estimate, conf.low, conf.high)
mod_full_c_y
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 1.01 0.946 1.08
Looks good! Now let’s do the “complete cases” analysis.
<- lm(y ~ x + c_y, data = data_cc) |>
mod_cc_c_y tidy(conf.int = TRUE) |>
filter(term == "x") |>
select(estimate, conf.low, conf.high)
mod_cc_c_y
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 1.01 0.909 1.11
Great! It works. This shows that as long as we have the right outcome model we can do complete case analysis even if the data is missing not at random (cool!). Now let’s do some imputation.
<- mice(
imp_data_c_y
data, m = 5,
method = "norm.predict",
formulas = list(x_obs ~ c_y),
print = FALSE)
Ok let’s compare how this model does “alone”.
<- with(imp_data_c_y,lm(y ~ x_obs)) |>
mod_imp_c_y pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_c_y
estimate conf.low conf.high
1 0.6888255 0.527796 0.849855
Oh no, very bad! The wrong imputation model is worse than complete case! By a lot! This estimate is off by 0.31. Does conditioning on c_y
help us at all?
<- with(imp_data_c_y, lm(y ~ x_obs + c_y)) |>
mod_double_c_y pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_c_y
estimate conf.low conf.high
1 1.009655 0.8887281 1.130582
Phew, the wrong imputation model with the wrong outcome model is back to being fine.
What if both are wrong?
Ok, what if we just had our useless variable, noise
.
<- lm(y ~ x + noise, data = data) |>
mod_full_noise tidy(conf.int = TRUE) |>
filter(term == "x") |>
select(estimate, conf.low, conf.high)
mod_full_noise
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.992 0.898 1.09
This is fine! c_x
and c_y
aren’t confoudners so we can estimate the coefficent for x
without them – noise
doesn’t do anything, but it also doesn’t hurt. What about complete case?
<- lm(y ~ x + noise, data = data_cc) |>
mod_cc_noise tidy(conf.int = TRUE) |>
filter(term == "x") |>
select(estimate, conf.low, conf.high)
mod_cc_noise
# A tibble: 1 × 3
estimate conf.low conf.high
<dbl> <dbl> <dbl>
1 0.887 0.748 1.03
Oops! We’ve got bias (as expected!) – we end up with a biased estimate by ~0.11.
What if we build the (wrong) imputation model?
<- mice(
imp_data_noise
data, m = 5,
method = "norm.predict",
formulas = list(x_obs ~ noise),
print = FALSE)
Ok let’s compare how this model does “alone”.
<- with(imp_data_noise, lm(y ~ x_obs)) |>
mod_imp_noise pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_noise
estimate conf.low conf.high
1 0.8807755 0.7217368 1.039814
This is also wrong (womp womp!) What if we try two wrong models?
<- with(imp_data_noise,lm(y ~ x_obs + noise)) |>
mod_double_noise pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_noise
estimate conf.low conf.high
1 0.8865363 0.7275635 1.045509
Nope 😢. Two wrong models here are not better than one! It’s worse! Womp womp.
Let’s put these all together:
bind_rows(
mod_full_c_x,
mod_full_c_y,
mod_full_noise,
mod_cc_c_x,
mod_cc_c_y,
mod_cc_noise,
mod_imp_c_x,
mod_imp_c_y,
mod_imp_noise,
mod_double_c_x,
mod_double_c_y,
mod_double_noise|>
) mutate(
mod = factor(c("Full data with c_x",
"Full data with c_y",
"Full data with noise",
"Complete case with c_x",
"Complete case wtih c_y",
"Complete case with noise",
"Imputation with c_x",
"Imputation with c_y",
"Imputation with noise",
"Two models with c_x",
"Two models with c_y",
"Two models with noise" ),
levels = c("Full data with c_x",
"Complete case with c_x",
"Imputation with c_x",
"Two models with c_x",
"Full data with c_y",
"Complete case wtih c_y",
"Imputation with c_y",
"Two models with c_y",
"Full data with noise",
"Complete case with noise",
"Imputation with noise",
"Two models with noise" )),
mod = fct_rev(mod),
-> to_plot
)
ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
geom_pointrange() +
geom_vline(xintercept = 1, lty = 2)
So there you have it, two wrong models are rarely better than one.
Addendum!
In writing this post, I found that I was getting biased results when I was correctly specifying my imputation model when using the {mice}
defaults (which is why the code above specifies norm.predict
for the method, forcing it to use linear regression, as the data were generated). I didn’t understand why this is happening until some helpful friends on Twitter explained it (thank you Rebecca, Julian, and Mario. I’ll show you what is happening and then I’ll show a quick explanation. Let’s try to redo the imputation models using the defaults:
<- mice(
imp_default_c_x
data, m = 5,
formulas = list(x_obs ~ c_x),
print = FALSE)
<- with(imp_default_c_x, lm(y ~ x_obs)) |>
mod_imp_c_x pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_c_x
estimate conf.low conf.high
1 0.7353276 0.6194273 0.8512279
Bad!
<- with(imp_default_c_x, lm(y ~ x_obs + c_x)) |>
mod_double_c_x pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_c_x
estimate conf.low conf.high
1 0.4602637 0.30023 0.6202974
Even worse!!
<- mice(
imp_default_c_y
data, m = 5,
formulas = list(x_obs ~ c_y),
print = FALSE)
<- with(imp_default_c_y, lm(y ~ x_obs)) |>
mod_imp_c_y pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_c_y
estimate conf.low conf.high
1 0.3405789 0.1126282 0.5685295
YIKES!
<- with(imp_default_c_y, lm(y ~ x_obs + c_y)) |>
mod_double_c_y pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_c_y
estimate conf.low conf.high
1 0.5128878 0.3216547 0.7041208
Better since we are conditioning on c_y
(but still bad!)
<- mice(
imp_default_noise
data, m = 5,
formulas = list(x_obs ~ noise),
print = FALSE)
<- with(imp_default_noise, lm(y ~ x_obs)) |>
mod_imp_noise pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_noise
estimate conf.low conf.high
1 0.4076967 0.2352155 0.5801779
EEK!
<- with(imp_default_noise,lm(y ~ x_obs + noise)) |>
mod_double_noise pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_noise
estimate conf.low conf.high
1 0.4124791 0.2497624 0.5751959
Just as bad..
Let’s put those in the original plot:
bind_rows(
mod_full_c_x,
mod_full_c_y,
mod_full_noise,
mod_cc_c_x,
mod_cc_c_y,
mod_cc_noise,
mod_imp_c_x,
mod_imp_c_y,
mod_imp_noise,
mod_double_c_x,
mod_double_c_y,
mod_double_noise|>
) mutate(
mod = factor(c("Full data with c_x",
"Full data with c_y",
"Full data with noise",
"Complete case with c_x",
"Complete case wtih c_y",
"Complete case with noise",
"Default Imputation with c_x",
"Default Imputation with c_y",
"Default Imputation with noise",
"Two models with c_x",
"Two models with c_y",
"Two models with noise" ),
levels = c("Full data with c_x",
"Complete case with c_x",
"Default Imputation with c_x",
"Two models with c_x",
"Full data with c_y",
"Complete case wtih c_y",
"Default Imputation with c_y",
"Two models with c_y",
"Full data with noise",
"Complete case with noise",
"Default Imputation with noise",
"Two models with noise" )),
mod = fct_rev(mod),
-> to_plot
)
ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
geom_pointrange() +
geom_vline(xintercept = 1, lty = 2)
AHH! This makes me so scared of imputation!!
Rebecca Andridge’s tweet finally helped me see why this is happening. The way the missing data is generated, larger values of c_x
have a higher probability of missingness, and for particularly high values of c_x
that probability is almost 1.
ggplot(data, aes(x = x, y = y, color = factor(x_miss))) +
geom_point() +
geom_vline(xintercept = 2.31, lty = 2) +
labs(color = "missing")
Take a look at the plot above. We have no non-missing x
values that are greater than 2.3. The way predictive mean matching (the default {mice}
method) works is it finds the observation(s) that have the closest predicted value to the observation that is missing a data point and gives you that non-missing data point’s value. So here, we are essentially truncating our distribution at 2.3, since that is the highest value observed. Any value that would have been higher is going to be necessarily too small instead of the right value (this is different from the linear model method used in the first part of this post, which allows you to extrapolate). This is supposed to be a less biased approach, since it doesn’t allow you to extrapolate beyond the bounds of your observed data, but it can actually induce bias when you have pockets of missingness with no observed x
s (which I would argue might happen frequently!). Here is an example of one of the imputed datasets, notice nothing is above that 2.3 line!
ggplot(complete(imp_default_c_x), aes(x = x_obs, y = y, color = factor(x_miss))) +
geom_point() +
scale_x_continuous(limits = c(-2.8, 3.1)) +
geom_vline(xintercept = 2.31, lty = 2) +
labs(color = "imputed")
What if we use y
in the imputation models
Including y
in the imputation model is definitely recommended, as was hammered home for me by the wonderful Frank Harrell, but I’m not sure this recommendation has permeated through the field yet (although this paper re-iterating this result just came out yesterday so maybe it is!).
Let’s see how that improves our imputation models:
<- mice(
imp_y_c_x
data, m = 5,
formulas = list(x_obs ~ c_x + y),
print = FALSE)
<- with(imp_y_c_x, lm(y ~ x_obs)) |>
mod_imp_c_x pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_c_x
estimate conf.low conf.high
1 1.034138 0.9109341 1.157343
Beautiful!
<- with(imp_y_c_x, lm(y ~ x_obs + c_x)) |>
mod_double_c_x pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_c_x
estimate conf.low conf.high
1 1.08074 0.8772936 1.284186
A bit worse, but not bad!
<- mice(
imp_y_c_y
data, m = 5,
formulas = list(x_obs ~ c_y + y),
print = FALSE)
<- with(imp_y_c_y, lm(y ~ x_obs)) |>
mod_imp_c_y pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_c_y
estimate conf.low conf.high
1 0.9298631 0.8137157 1.046011
Not bad!! A bit biased but way better.
<- with(imp_y_c_y, lm(y ~ x_obs + c_y)) |>
mod_double_c_y pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_c_y
estimate conf.low conf.high
1 1.01812 0.9415964 1.094644
Love it, looks great after conditioning on c_y
<- mice(
imp_y_noise
data, m = 5,
formulas = list(x_obs ~ noise + y),
print = FALSE)
<- with(imp_y_noise, lm(y ~ x_obs)) |>
mod_imp_noise pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_imp_noise
estimate conf.low conf.high
1 0.9671593 0.8580147 1.076304
Oo lala, even does well when we don’t have the right model (this makes sense because we are using y
!)
<- with(imp_y_noise, lm(y ~ x_obs + noise)) |>
mod_double_noise pool() |>
tidy(conf.int = TRUE) |>
filter(term == "x_obs") |>
select(estimate, conf.low, conf.high)
mod_double_noise
estimate conf.low conf.high
1 0.9697639 0.8612499 1.078278
Let’s put those in the original plot:
bind_rows(
mod_full_c_x,
mod_full_c_y,
mod_full_noise,
mod_cc_c_x,
mod_cc_c_y,
mod_cc_noise,
mod_imp_c_x,
mod_imp_c_y,
mod_imp_noise,
mod_double_c_x,
mod_double_c_y,
mod_double_noise|>
) mutate(
mod = factor(c("Full data with c_x",
"Full data with c_y",
"Full data with noise",
"Complete case with c_x",
"Complete case wtih c_y",
"Complete case with noise",
"Imputation with c_x and y",
"Imputation with c_y and y",
"Imputation with noise and y",
"Two models with c_x",
"Two models with c_y",
"Two models with noise" ),
levels = c("Full data with c_x",
"Complete case with c_x",
"Imputation with c_x and y",
"Two models with c_x",
"Full data with c_y",
"Complete case wtih c_y",
"Imputation with c_y and y",
"Two models with c_y",
"Full data with noise",
"Complete case with noise",
"Imputation with noise and y",
"Two models with noise" )),
mod = fct_rev(mod),
-> to_plot
)
ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
geom_pointrange() +
geom_vline(xintercept = 1, lty = 2)
Pretty good! Maybe I’m feeling a little better about imputation. Maybe. But it had better include the outcome (which I’ll admit feels very weird for my little causal brain).